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Abstract 

We propose and study a novel stochastic inertial primal-dual approach to solve composite 
optimization problems. These latter problems arise naturally when learning with penalized 
regularization schemes. Our analysis provide convergence results in a general setting, that 
allows to analyze in a unified framework a variety of special cases of interest. Key in our 
analysis is considering the framework of splitting algorithm for solving a monotone inclusions 
in suitable product spaces and for a specific choice of preconditioning operators. 


1 Introduction 


Incorporating prior information about the problem at hand is key to learn from complex high 
dimensional data. In a variational regularization framework, a learning solution is found solving a 
composite optimization problem, given by an error term and a suitable regularizer [3l]. It is the 
design of this latter term that allows to incorporate the prior information available. Indeed, this 
observation has recently lead to the study of vast families of regularizers O EH] • 

From an optimization perspective, the problem arises of devising strategies to solve optimization 
problems induced by general regularizers (and error terms). While such problems might in general 
be non smooth, the composite structure (the functional to be minimized is a sum of terms composed 
with linear operators) can be exploited considering splitting techniques [H |25]. In particular, 
first order primal-dual methods have been recently applied to a variety machine learning and 
signal processing problems, and shown to provide state of the art results in large scale composite 
optimization problems um- Interestingly, the convergence of most of these methods can be 
analyzed within a common framework. Indeed, many different algorithms can be seen as instances 


1 


of a splitting approach for solving, so called, monotone inclusions in suitable product spaces and 
for a specific choice of preconditioning operators. Taking this perspective a unified convergence 
analysis can be established in a Hilbert space setting. The price payed for this generality is that 
rates of convergence are not be possible to obtain [1]. 

In this paper, we are interested in developing stochastic extensions of inertial primal-dual ap¬ 
proaches for composite optimization. This question is of interest when only an uncertain/partial 
knowledge of the functional to be minimized [TH] is available, but also to consider randomized 
approaches to deterministic optimization problems. While there a few recent studies deal with 
the analysis of stochastic primal dual methods in the learning setting for specific problems IMIE], 
we are not aware of any study of the general stochastic and inertial versions of the primal-dual 
methods proposed in this paper. Our main result is a convergence theorem for inertial stochastic 
forward-backward splitting algorithms with preconditioning. 

This point of view allows to directly get as corollaries convergence results for a wide class 
of optimization methods, some of them already known and used, and some of them new. In 
particular, in the proposed methods, stochastic estimates of the gradient of the smooth components 
are allowed, and both the proximity operators of the involved regularization terms and the involved 
linear operators are activated independently and without inversions. From a technical point of view, 
our analysis has three main features: I) we consider convergence of the iterates (there is not an 
analogous of function values in the general setting) in a Hilbert space; and 2) the step-size is 
bounded from below; this latter condition naturally leads to more stable implementations, since 
vanishing step-sizes create numerical instabilities, however it requires a vanishing condition on 
the stochastic errors; 3) we consider an inertial step, that in minimization cases lead to better 
convergence rates [5]. 

The rest of the paper is organized as follows. In Section we describe the setting, and some 
possible choices of regularization terms. Moreover we show how the need of studying monotone in¬ 
clusions naturally arise starting from minimization problems. In Sectionj^we introduce the stochas¬ 
tic inertial forward-backward algorithm with preconditioning and state its convergence properties. 
The derivation of the novel primal-dual schemes, and the comparison with existing methods can 
be found in Section]^ Finally, in Sectionwe discuss the results of some numerical simulations. 
The proofs of our statements is deferred to the Appendix. 


2 Setting 


We consider the generalized learning model. Let H be a measurable space and assume there is a 
probability measure p on H. Let N . The measure p is fixed but known only through a training 
set (^i)i<i< 7 v £ of samples i.i.d with respect to p. Consider a hypothesis space a bounded 
positive self-adjoint linear operator V: % ^ and a loss function £ : H x —>■ [0, -|-oo[. Suppose 
that (. has a Lipschitz continuous second partial derivative in the sense that there exists /3 > 0 such 
that, for every ^ G H and for every {wi,W 2 ) G 

i|V^£(e,uii)-V^£(e,u;2)|| < (1//3 )||u;i-u;2||. (2.1) 
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Let /: —> M be convex and lower semicontinuous. For every j G {1,..., s}, let Qj be a Hilbert 
space, let gj : Qj —)• [0, +oo] be a convex and lower semicontinuous function, and let Dj: T-L ^ Qj 
be a linear and bounded operator. A key problem in this context is 

S 

minimize E + f{w) + gj{Djw), (2.2) 

where expectation can be taken both with respect to p or with respect to a uniform measure on 
the training set. In the first case we obtain the regularized learning problem, and in the latter case 
we get the regularized empirical risk minimizaton problem, since for every w ^ T-L, 

1 ^ 

E[£(^,tc)] = — (2.3) 

i=l 

Supervised learning problems correspond to the case where = X x y, the training set is 
{ii)i<i<N = {xi,yi)i<i<N G (Af X 3^)'^, % \s a, reproducing Hilbert space of functions, and, for 
every {{x,y),w) G'Ex'H, i{x,y,w) = L{y,w{x)) for some loss function L: y xy ^ [0,+oo[. 


The algorithms studied in this paper, can be used to directly solve the regularized expected loss 
minimization problem (2.2) or to solve the regularized empirical risk minimization problem. 


The term gjoDj can be seen as a regularizer/penalty encoding some prior information about 
the learning problem. Examples of convex, non-differentiable penalties include sparsity inducing 
penalties such as the ii norm, as well as more complex structured sparsity penalties [2^1^. 


2.1 Structured sparsity 


Consider the empirical risk corresponding to a linear regression problem on with the square loss 
function, for a given training set {xi,yi)i<i<N £ (1^*^ x 


N 


w G 


^ N 


'^{{w,Xi) - yif + f{w) + '^gj{Djw). 


(2.4) 


2 = 1 


Several well-known regularization strategies used in machine learning can be written as in (2.4), for 
suitable convex and lower semicontinuous functions /: —?• [0, +oo[ and gj, and linear operators 

Dj. For example, fused lasso regularization corresponds to / = || • || i and, for every j G {!,..., d— 1} 
: M —)> M, gj = \ ■ \ , that has to be composed with Dj : —>■ M, DjW = rcj+i — Wj [35]. In case 

of group sparsity, we assume a collection {Gi,..., G^} of subsets of {1,..., d} is given such that 
Uj^^Gj = {1,..., d}. A popular regularization term is /D regularization, for g G [1, -Eoo]. This 
can be obtained in our framework choosing 


/ = 0 , gj = dj 




with II • llq the norm, and Dj the canonical projection on the subspace {w G = 0\/k ^ Gj} 

and {dj)i<j<s G a vector of weights. Various grouped norms, such as graph lasso, or hierarchical 
group lasso penalties, can be recovered choosing appropriately the groups Gi,...,Gs |3]. The 
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OSCAR penalty [7], which can be used as regularize! when it is known that the components of the 
unknown signal exhibit structured sparsity, but a group structure is not a priori known, can be 
included in our model. More precisely, it is possible to set f{w) = Ai||r(;||i + A 2 Yli<j ™ax{|tCj|, |tCj|}. 
This leads to the proximal splitting methods as those proposed in [39]. Note that this approach 
would require the computation of the proximity operator of /, which is not straightforward. An 
alternative approach is to set / = Ai|| • ||, and, for every {i,j) G {!,...,with i < j, define 
Dij: —>■ M^, acting as DijW = {wi,Wj), and gij: —>■ [0,+oo[, such that gij{u) = ||u||oo- With 
this choice, the algorithms developed in this work can be used to derive stochastic primal-dual 
proximal splitting methods, which differs from the ones treated in [39| and are novel also in the 
deterministic case. In particular, they require only the computation of the proximity operator of 
the conjugate of the function g^j which is the projection on the ball in M^. Latent group lasso 
formulations and, more generally, structured sparsity penalties defined as infimal convolutions 
[23137], can also be treated with analogous definitions of gj and Dj. We also mention that multiple 
kernel learning problems are also included in our framework [251 [3]. 


2.2 From Problem (2.2) to monotone inclusions 


Set 


F = Em,-)]. 


The primal-dual methods proposed in this paper are based on the idea that problem (2.2) can be 
formulated as a saddle point problem 


mill sup F{w) + f{w) + V {{F>*Vj \ w) - g*(vj)) . 

^^Hivi,...,v,)GgiX...xGs ^ 


(2.5) 


If strong duality holds, then [3 Proposition 19.18(v)] implies that every solution {w,vi,... ,Vs) G 
H xQi X 


X Qs of {2.5) satisfies 


0 G VF{w) + dfiw) + ZUi 
OG -DjW + dg*{vj) VjE {!,..., s} 


( 2 . 6 ) 


We denote by "P x T> the set of solutions of ( |2.6[ ). In (2.5), •* denotes the adjoint of a linear 
operator and the conjugate of the function gj (see e.g. [3] for the definition). Let us dehne 
G = Gi X ... X Gs, let P: P —> G, (Vrc G Ti) Dw = {Diw,..., Dgw) and g: G ^ [0,-|-oo]. 


Vu = (ui,... ,Vs) g{v) = rewrite the inclusion in (2.6) in a more compact 

form in the space T-L x G, sts 

(0,0) G (VF(W),0) + {df{w)+D*v,-Dw + dg*{v)). (2.7) 

The previous formulation leads to the study of a more general class of problems, which retain the 


same key properties of the operators in (2.7) 


Problem 2.1 Let /C be a Hilbert space, let A: /C —)> 2^ be a maximally monotone (multivalued) 
operator, and let R: /C —)■ /C be /3-cocoercive for some f5 G ]0, -|-oo[. The problem is to find z G JC 
such that 

0g{A + B){z) (2.8) 


under the assumption that the set of solutions V of inclusion (2.8) is nonempty. 
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We recall that an operator ^: /C —)• 2^ is maximally monotone if it is monotone, namely for every 
vi G Azi and V 2 G Az2 in /C, {vi — V 2 , zi — Z 2 ) >0, and there is not a monotone operator whose 
graph properly contains the graph of A. An operator i?: /C —)• /C is /3-cocoercive if, for every zi 
and Z 2 in K, 

{zi - Z2 I Bzi - BZ2) > li\\Bzi - Bz2\\‘^. 

The imposed strncture allows to apply a forward-backward algorithm to the monotone inclusion 


in (2.8). Moreover, if in (|2.7|) we define. 


A: {w,v) € 71 X G {df{w) + D*v, —Dw + dg*(v)) 

B: {w,v) £7i X G ^ {VF{w),0) 

we get that A is maximally monotone since it is the sum of a subdifferential operator (which is 
maximally monotone) and a skew operator [H Example 20.30]. Moreover, B is cocoercive by the 
Baillon-Haddad theorem, since the gradient is assumed Lipschitz continuous. In the determistic 
case it has been shown that, by properly choosing a metric on the product space 71 x G different 
primal-dual algorithms for solving problem (2.2) can be derived in this way [im ca [H]. Inertial 
versions of forward-backward algorithms for monotone inclusions have been considered in |22j and 
their convergence has been proved. 

In the following sections we will show how to extend the analysis to the case when we have 
access only to a stochastic estimate of the operator B, obtaining as a result different stochastic 


inertial primal-dual schemes to solve problem (2.2). Key tools in the following sections will be 
(/ -|- which is called resolvent of A and is defined everywhere and single valued if A is 

maximally monotone and the proximity operator, that is the resolvent of the subdifferential of a 
convex function. 


3 Stochastic Inertial Forward-backward splitting method for solv¬ 
ing monotone inclusions 


While stochastic proximal gradient methods have been studied in several papers (see e.g. PEaEii), 
there are only two recent preprints studying convergence of stochastic forward-backward algorithms 
for monotone inclusions HIES]. In this section we take another step in filling the gap between the 
existing analysis in the deterministic setting mm and the one available in the stochastic one. 
More precisely, we deal with stochastic inertial variants with preconditioning. 


Algorithm 3.1 In the setting of Problem |2.1[ let f7: /C —)• /C be a self-adjoint and strongly positive 
operator. Let e G ]0, min{l,/3||[/||“^} [, let (7n)neN be a sequence in [e, (2 — e)/3||f7||“^], and let 
(ctn)neN be a sequence in [0 ,1 — e\. Let (r„)„gfsj be a ?7-valued, square integrable random process, 
let wq be a ?7-valued, squared integrable random variable and set u;_i = wq. Furthermore, set 


(Vn G N) 


— Wn “b CXfiiWn l) 

Wn+1 — J^„U Ai^Zn ^nU Tri). 


(3.1) 


The first step of the algorithm is the inertial one, where a combination of the last two iterates 
is taken. The operator [/ is a preconditioner. While for general choices of U, the resolvent operator 
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J'ynUA is not computable in closed form, for suitable choices it allows to derive the above mentioned 
primal dual schemes. In particular, we will see in the subsequent sections that U will be built 
starting from the linear operators {Dk)i<k<s- When = Bzn, we are back to the deterministic 
inertial forward-backward algorithm which has been studied in [22] (see also 126|). Therefore, 
Algorithm |3.1| is a preconditioned stochastic inertial forward-backward method. To get convergence 
results, we need to impose restrictions on the stochastic approximations of Bzn and on the choice 
of the sequence {an)neN- 

Theorem 3.2 Consider Algorithm 
following conditions are satisfied. 

(i) (Vn G N) E[r„|T„] = Bwn a.s. 

(ii) EneN “ Bwn\\'^\3'n] < +00 a.s. 

(hi) sup„gpj \\wn - Wn-i\\ < oo a.s. and 

Then, the following hold for some a.s. V-valued random variable w. 

(i) Wn ^ w a.s. 

(ii) Bwn —)• Bw a.s. 

(hi) If B is uniformly monotone at w, then \\wn — th|| —)■ 0 a.s. 

Condition 1 means that, for every iteration n, is an unbiased estimate of Bwn. Moreover, 
Condition 2, requires the variance of the stochastic approximation to decrease, and in particular to 
be summable. In principle this may seem a strong condition, but it is necessary to derive primal- 
dual stochastic algorithms. Indeed, for such derivation, an analysis of forward-backward with 
nonvanishing step-size is needed. This is a main difficulty to overcome, since even for minimization 
problems of a smooth function (A = 0 and B = Vf for some function /), it is known that almost 
sure convergence of the iterates cannot be derived for fixed step-size and only assuming that the 
variance is bounded, namely E[||r„ — BwnW^lTn] < cr'^, and there are explicit counterexamples 
(see e.g. [I8| and references therein). On the other hand, a constant stepsize could be used by 
using different stochastic approximations of the gradients, for instance those of lUG methods |36j . 
see also |20|, which indeed use an approximation of the gradient having a smaller variance. In 
general we can only obtain weak convergence, as it usually happens in infinite dimensional spaces 
also for the deterministic implementations. Strong convergence can be obtained only additional 
monotonicity assumptions, that for the case of minimization are related to uniform (or strong) 
convexity. The sequence is required to be summable. Therefore, though the structure of the 
algorithm includes a stochastic extension of the well-known Nesterov’s accelerated method EZ], the 
choice of an = (n — l)/(n -|- 2) used in the minimization setting, is not allowed by our theorem. 
Our methods are new even in the case in which an = 0. In this case there is not an inertial step, 
and we get the stochastic forward-backward algorithm studied in EH and in [32]. Here we make 
different assumptions with respect to both papers. Indeed, the analysis is in the same setting se 
in EH, but here we require a weaker condition on summability of the errors. With respect to [32], 
we removed the strong monotonicity assumptions on the operators, and a non-vanishing stepsize 


3.1, and set (Vn G N) Tn = a{wo,... ,Wn). Suppose that the 
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is allowed, but under a stronger conditions on the errors. The proof is based on showing that the 
sequence (rCn)neN is stochastic quasi-Fejer monotone [H] with respect to the set of solutions V. 


4 Special cases: minimization algorithms 


We show that the results obtained for the forward-backward algorithm obtained in the previous 
section can be used to prove convergence of different classes of primal-dual algorithms, as well as 
previously known algorithms for solving problem (2.2), and more generally, problem (2.5). 


4.1 Preconditioned inertial stochastic forward-backward splitting 


In (2.8), set ^ = d(^ i—)• and = Vi{wk,Ck)- Then, in this case we recover 

the inertial forward-backward splitting algorithm [271 [5]. As mentioned above, the conditions on 
(an)neN do not allow the standard choices to be made. Convergence in expectation of the objective 
function (without preconditioning) has been studied in the stochastic setting by several authors, see 
e.g. jzudnid]. We underline that a suitable preconditioning can significantly improve convergence 
results 


4.2 First class of primal-dual stochastic algorithms 


This class of algorithms can be seen as an inertial version of an extension to the stochastic setting 
of the primal-dual deterministic algorithms studied in |38l [T^ for solving problem (2.5). 


Algorithm 4.1 For every k G {1, ..., s}, let Wk ■ Qk ^ Gk and V ^ % he self-adjoint and 
strongly positive. Let e G ]0,1[, let (an)neN be a sequence in [0,1 — e]. Let (a^jneN be a F^-valued, 
squared integrable random process, let wq be a F^-valued, squared integrable random vector, and 
set W-i = wq. Let vq be a ^-valued, squared integrable random vector and set u_i = vq. Then, 
iterate, for every n G N, 


Un — Wji (y.n{,Wn Wn—l) 

For A: = 1,..., s 

dk,n — Vk^n T Oinivk^n '^k,n—l) 

Vk,n+1 •— {dk^n T hVki^Dki^n 2F( -|- a^))) 

Wn+l := proxy ^ (un - y{Yl=l ^kdk,n + &„)). 

In the special case when F = r Id and, for every k G {1,..., s}, Wk = (Jk Id, «„ = 0 for every 
n G N, and the errors are not stochastic errors. Algorithm |4.1| recovers the algorithm studied in 
|38j and similar algorithms in HU. It can be immediately seen that each proximity operator is 
activated individually and no inversion of the linear operator D is required. 
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(4.2) 


Theorem 4.2 In the setting of Algorithm \4-^ assume that 

7 = (i - (E ^ 

k=l 

and s < min{l, 7 }. Suppose that the following conditions are satisfied: 

(i) (VneN) E[a„|T,] = VF(«„). 

(ii) En6NE[||an-VF(n.)f|J„] <+oo. 

(hi) sup„gp} \\wn-Wn-i\\ < oo a.s. andmaxi<fc<^ sup^gpj \\vk,n-Vk,n-i\\ < oo a.s., < 

+ 00 . 

Then the following hold for some random vector (w,vi,... ,Vs), 'P x P-valued almost surely. 

(i) Wn and (V/c G {1,..., s}) Vk,n Vk almost surely. 

(ii) Suppose that the function F is uniformly convex at w almost surely. Then Wn —)• w almost 
surely. 


The proof of Theorem 4.2 whose sketch can be found in the appendix, starts from the obser¬ 
vation that Algorithm 4.1 is an inertial stochastic forward-backward algorithm. Such algorithm is 
applied inUxO, with A and B as in (2.7), and preconditioning operator U, which is defined as the 


inverse of the lines operator from H x Q to TL x Q, (w, v) i—)> (14 irr — D*v, (117 — T)j.w)i<k<s)- 


Remark 4.3 Uniform convexity of F, which is an expectation, follows from uniform convexity of 
the loss function with respect to the second variable. More precisely, let w G B. Suppose that 
there exists 4>\ : [0, -|-oo[ —)• [0, -t-oo] increasing and vanishing only at 0 such that, for every | G H 
and for every w gTL, 


w) > £(^, w) + (V£(^, w) I tc - ra) -E (j){\\w - tcH). 
Then F is uniformly convex at w with modulus 4*. 


Stochastic inertial Chambolle-Pock algorithm. In the special case when s = 1, ^ = 0, 


U = rid and lUi = fild. Algorithm 4.1 is an inertial variant of Algorithm 1 in [8], which can be 
recovered by setting an = 0. Since the second inequality in (4.2) is always satisfied (in this case fi 


can be chosen arbitrarily small), the conditions on the stepsize reduce to 

Ta\\Di\\‘^ < 1 . 

Weak convergence of the iterates obtained here does not follow from the analysis in [8] for Algorithm 
2, where the assumptions on the sequence (an)neN are the typical ones for accelerated methods. A 
related algorithm, the so called PDHG, has been studied in [IQIIIZ!, which is a deterministic version 
of the above algorithm, and corresponds to the case an = 0 and £ = 0. Finally, a preconditioned 
version of the primal-dual Algorithm 1 in [8| has been studied in [29], where the conditions on the 


preconditioning matrices correspond to the ones in (4.2). 








4.3 Second class 


In this section we suppose / 


0 in (2.2). 


Algorithm 4.4 Let V: Ti ^ T-L he a bounded linear self-adjoint and strongly positive operator. 
For every /c G {1,..., s}, let Wk- Qk Qk be linear, bounded, self-adjoint, and strongly positive. 
Let e G ]0,1[, and let (A„,)neN be a sequence in [e, 1], let (an)neN be a sequence in [0 ,1 — e\. let 
(a^jneN be a L^-valued, squared integrable random process, and let rco be a L^-valued, squared 
integrable random vector and set w-i = tco. Let vo be a ^^-valued, squared integrable random 
vector and set u_i = vq. Then, iterate, for every n G N, 


Un = Wn + an{Wn - Wn-l) 

For k = 1,...,s 
I dk^n — Vk^n T Oin{vk,n 

Sn — Ufi Va^ V 

(4.3) 

For k = 1,...,s 

qk,n = PrOX^'= {dk,n + WkDkSn) 

'^k,n+l ^k,n T ^n{Qk,n ^k,n) 

'^n+1 — F a^i X)fc=l ^kd-k,n‘ 

Theorem 4.5 In the setting of Algorithm \4-^ let ft be a strictly positive number such that ([23 
is satisfied. Assume that Ylk=i < 1, that /3||I/||“^ > 1/2, and that e < min{l, /3}. 

Set Tn = a{{wo,vo )..., {wn, Vn)) and suppose that the following conditions are satisfied: 


(i) (VnGN) E[an\3^n] = VF{un). 

(ii) EneNE[||an-VF(n„)f|T„] <+oo. 

(hi) sup^gN W'Wn - Wn-i\\ < OO a.s., maxi<fc<s sup^gj^ \\vk,n - Vk,n-i\\ < OO a.s., and EneN < 
-|-oo. 


Then the following hold for some random vector (w,vi,... ,Vs), V x T-valued almost surely. 

(i) Wn and (V/c G {1,..., s}) Vk,n Vk almost surely. 

(ii) Suppose that the function F is uniformly convex at w, then Wn^w almost surely. 


Generalized forward-backward for nonseparable penalties. Algorithm |4.4| is a generaliza¬ 
tion under several aspects of the algorithm in [23l equation (24)]. Indeed, here we presented a 
convergence analysis for a more general objective function, adding stochastic noise and an inertial 
step. Moreover, Algorithm 4.4 is a stochastic and inertial version of the algorithm in m Propo¬ 
sition 4.3]. A special case of Algorithm 4.4 has been proposed in j^, where s = 1, F = rid, and 
IFi = Id. 


9 







5 Numerical experiments 


Let N and p be strictly positive integers. Concerning the data generation protocol, the input 
points {xi)i<i<N are uniformly drawn in the interval [a, 6] (to be specified later in the two cases 
we consider). For a suitably chosen finite dictionary of real valued functions {(t>k)i<k<p defined on 
[a, 6], the labels are computed using a noise-corrupted regression function, namely 

p 

(Vi G {!,..., iV}) Vi = '^Wk4>k{xi) + ei, (5.1) 

k=l 

where {wk)i<k<p G and e* is an additive noise ~ AA(0,0.3). 

We will consider a polynomial dictionaryy of functions, i.e. (V/c G {1,... ,p}) : [—1,1] —M, 

4’k{x) = x^~^. We estimate w by solving the following regularized minimization problem 


1 

minimize — 


» p 2 ° 1 /2 

i=l k=l l=l jGGi 


(5.2) 


where A is a strictly positive parameter. Problem 5.2 is a special case of Problem |2.2[ and hence it 
can be solved by using the stochastic inertial forward-backward splitting (first class). We set 


p = 32, s = 8, N = 4:8, 7 „ = 15/(n-|-100), an = ^\, A = 0.02, 

w = [3,2,1,0,1,0,1, 2,-1,0,0,-2,-1,1,0.5, 0,1, 0,4,0,-2,0,0,-2,1.0,1,0, 0.2,-0.1,0,0, 
(V/ G 8}) Gi = [4/-3,...,41 + 1] 


1 ] 

(5.3) 


Here, we use the variants of the exact gradient for the stochastic gradient as follows 


an = VF{un) + AA(0, Sig)/n. 


(5.4) 


The resulting regression functions using the stochastic inertial primal-dual splitting (SIPDS) are 
shown in Figure[^(right). To check convergence towards a solution of (5.2), we computed a solution 
of ( |5.2[ ) by running the corresponding deterministic primal-dual splitting method in [38j for 5000 
iterations. 




Figure 1: Convergence of the iterates of SIPDS applied to Problem |5.2| (left), and corresponding 
approximations of regression functions (right). 
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A Proofs 


Proof. [Proof of Theorem |3.2| Since U is self-adjoint and strongly positive, UA is also maximally 
monotone by |12[ Lemma 3.7]. Since B is cocoercive and has full domain, therefore it is also 
maximally monotone [U Corollary 20.25]. Let w €V and set 

(Vn G N) Un = Zn - Wn+l -'JnUirn - Bw). (A.l) 

Then, we have 

(Vn G N) w = Jj^ua{w - jnUBw). (A.2) 

We derive from m Lemma 3.7] that J^^uA is firmly nonexpansive with respective to the norm 
II • IIy, therefore 

(Vn G N) \\wn+i - w\\y < \\zn - w - 'JnUirn - Bw)\\y - ||nn||y 

= ll^n - w\\y - 2'yn{Zn -W,rn- Bw) + ll\\U [rn - Bw)\\y - \\Un\\v ■ (A.3) 

By 1, since Zn is T„-measurable, we have 

(Vn G N) E.[izn-w,rn- Bw)\3'n] = izn-w,Bzn- Bw). (A.4) 

By the same reason, for every n G N, since Bzn is Tn-measurable, we also have 

E[||C/(r„ - Bw)\\\l\‘Jn] = mu{rn - Bzn)fy\‘^n] + \\U{Bzn - Bw)\\\l 

+ 2E[iBZn - Bw,rn - BZn)\3'n] 

= E[\\Uirn - BZn)fy\3^n] + WiBZn - Bw)fy 

< E[\\Uirn - Bzn)\\l\3^n] + IMp-^Zn - W, BZn - Bw) , (A.5) 


where the last inequality follows from cocoercivity of B. Therefore, for every n G N, we derive 
from (A.3), (A.4) and (A.5) that 

E[||u ;„+1 - w\\y\‘3^n] < ll^^n “ w\\y - E'JniZn “ W, BZn “ Bw) 

+ ^lE[\\U{rn - BZn)fy\‘Jn] - E[||nn||^|Tn] 

< \\Wn - W\\y -t- an{\\Wn “ w\\\y - \\Wn-l “ w\\y) + Cn “ ?n (A.6) 

^ (1 -|- Crn)(||'M^n "^llly T Cn (o^nll’U^n—1 ^lly T 


(Vn G N) 


with 

Cn — II Wn\ \ Iy T Vn^[ll^ i^n BZn) || y |ffn] 

Cn — E [ 11 Un 11 y I ffn] T £■ Vh ( ^n W , BZn Bw). 

Note that, for each n G N, Cn and Cn are non-negative and Tn-measurable. Moreover, (Cn)neN is 
summable, and hence, we derive from m Theorem 1] that 


3t= lim \\wn-w\\y and ^(an||rcn-i - if ||y + Cn 

nGN 


< -Eoo. 


(A.8) 


Moreover, since inf jn >0, we have 

{Zn — w, BZn — Bw) < -|-00 =► {Wn — W, BZn — Bw) -A- 0. (A-9) 

nEN 
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and 


(A.IO) 


E[||u„||^|J„] < +00 E[||Zn - Wn+l - 'JnU^rn - Bw)f\3'n] 0. 


n£N 


Next, from the cocoercivity of B, we derive from (A.9) that 


Bzn —)■ Bw. 


(A.ll) 


and we also derive from (A.IO) and (A.ll), and condition 2 in the statement, that 

B\\Zn - Wn+lf\‘ 3 ^n] < 2E[||2;„ - Wn+l “ InU (Vn - Bw)f\‘Jn] + 2E[||7„t/(r„ - Bw)f\Jn] 

< 2(E[\\Zn - Wn+l - lnU{rn - || ^ + 2E[||7„C/(r„ - BZn)f\3^n\ 

+ 2\\'^nU{BZn- Bw)\\^ ^ Q. (A.12) 

Hence, by condition 3, we obtain 

E.[\\rn-Bwf\3^n]^^- 

Now define 


(A.13) 

(A.14) 


(VtT- G M) Wn-\-\ — J'-fnAi^n '^nUBZn). 

Then Wn^i is T^-i^easurable since o {Id—^nUB) is continuous. Therefore, 

(Vn E N) \\Zn - Ihn+illy = E[\\Zn “ tCn+l ||y |3“n] 

— 2E[||n;„+i — ^nllyllbn] + 2E[||7„t/(r„ — il2„)||y|3“„] —>■ 0. (A.15) 


(i): Now, let tc be a weak cluster point of {wn)nm, he-, there exists a subsequence (u;fc^)„gis} 
which converges weakly to w. It follows from our assumption that {zk„)neN converges weakly to w. 


By (A.15), {wk„+i)n&N converges weakly to w. On the other hand, since B is maximally monotone 


and its graph is therefore sequentially closed in x [H Proposition 20.33(ii)], by (A.ll), 

Bw = Bw. By dehnition of resolvent operator, we have 


U ^{zk„-Wk„+i) 


Ikn 


- Bzk„ E Awk„+i, 


(A.16) 


and hence using the sequential closedness of the graph of A in x [H Proposi¬ 

tion 20.33(ii)], we get —Bw E Aw or equivalently, u; E (A -|- il)~^({0}). Therefore, every weak 
cluster point of {wn)neN is i^i {A + il)“^({0}) which is non-empty closed convex [H Proposi¬ 
tion 23.39]. By [TTl Theorem 1], {wn)n^'H converges weakly to a random vector w, taking values in 
{A + H)“^({0}) almost surely. 


(ii) From the cocoercivity of B, for every n in N 

\\BWn - BZnW < P~^\\Wn “ Zn\\ = f3~^an\\Wn - Wn-l\ 


(A.17) 


by ( |A.8 ). By ( A.ll| ), we obtain Bwn Bw. 


(iii)[ This conclusion follows from since strong monotonicity implies demiregularity [21 Definition 


2.3] and (ii) □ Next we give a sketch of the proof for Theorem 4.2 Proof. [Proof of Theorem |4.2 
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Let JC = Ti X 0, and define A and B as in (2.7). Define W ■. Q ^ Q hj setting W{vi,... ,Vs) = 
iWivi, ... ,WsVs)- Let D': /C —)■ /C be the linear operator defined by setting {w,v) i— >■ {y~ 


r-l 


w — 


D*v,W~^v — Dw). Since ||\/lDD\/i/|| < 1 by assumption, proceeding as in [28l Lemma 4.3(i) 
and Lemma 4.9(i)], we get that U' is strongly positive and self-adjoint. Therefore, its inverse, 
denoted by U is also strongly positive and self-adjoint. Since B : {w,v) i—)■ {VF{w),0), and 
VF is /3 cocoercive, it follows that B is /3||1L||“^ cocoercive in the norm induced by V. By [28L 
Lemma 4.3(ii)] we also derive that B is cocoercive in the norm induced by U with coco erci vity 
constant 7 = (1 — ||\/1TD\/D||)/3||D||“^. The statement follows by noting that Algorithm 
be equivalently written as 


4.1 


can 


(Vn G N) 


{Unidn) — {Wyi^Vn^ F Oin(y(Wn^Vn) Ij l)) 

Lji-|-l) — JuAiiUn, dn) UiXm 0)) 


(A.18) 


and all the assumptions of Theorem |3.2| are satisfied. □ Finally, we also present the key steps to 


prove Theorem 4.5 The proof follows the same lines as that of Theorem 4.2 


Proof. Proof of Theorem 4.5 Let )C = PL x Q, and define A and B as in (2.7). Define W: G ^ G 
by setting W {vi,... ,Vs) = {Wivi ,..., Wsf^). Let T: JC ^ JC: {w, v) i—)• {Vw, — D*VD)~^v). 
Then T is strongly positive and self adjoint. Algebraic manipulations then show that with this 
choice we can express Algorithm 4.4 as 

(\/ (Z M'l (^rt) dfi) — {Wni Vn) T Oin{{Wnj Vn) iWn—li '^^n—l)) 

^ ’ K+1, Vn+l) = JTAiiUn, dn) - T{rn, 0)), 


(A.19) 


which is a special instance of iteration (3.1), with (Vn G N) 7 ^ = 1 G ] e, (2 — e)/3||r|| ^ [ □ 
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